Exploratory & Statistical Data Analysis on Marketing_analytics

I am just running my hands on this dataset to find out ideas about EDA on marketing domain and extracting insights and statistical analysis related to marketing topics.

Defining columns:

  1. Year_Birth : Customer's birth year
  2. Education: Customer's education level
  3. Marital_Status: Customer's marital status
  4. Income: Customer's yearly household income
  5. Kidhome: Number of children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Dt_Customer: Date of customer's enrollment with the company
  8. Recency: Number of days since customer's last purchase
  9. MntWines: Amount spent on wine in the last 2 years (Mnt = amount spent on the product in the last 2 years)
  10. NumDealsPurchases: Number of purchases made with a discount Each Num column represents the number of purchases.
  11. AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  12. AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise All Accepted columns represent the accepted offer wrt to the campaign no.
  13. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  14. Complain: 1 if customer complained in the last 2 years, 0 otherwise
  15. Country: Customer's location
  16. ID: Customer's unique identifier

We have a total of 28 variables. I have segregated in the above list with common names. You can give a read to the above columns.

Now, we need to change the format of the 'Income' column to Float and do a reformatting to remove '$' and ',' from the data

Multiple features contain outliers (see boxplots below), but the only that likely indicate data entry errors are Year_Birth <= 1900. So, we have to remove them.

We found that the column 'Dt_Customer' should be coverted to Datetime format

We have certain inputs of the feature names below: 1. The total number of dependents in the home ('Dependents') can be engineered from the sum of 'Kidhome' and 'Teenhome' 2. The year of becoming a customer ('Year_Customer') can be engineered from 'Dt_Customer' 3. The total amount spent ('TotalMnt') can be engineered from the sum of all features containing the keyword 'Mnt' 4. The total purchases ('TotalPurchases') can be engineered from the sum of all features containing the keyword 'Purchases' 5. The total number of campains accepted ('TotalCampaignsAcc') can be engineered from the sum of all features containing the keywords 'Cmp' and 'Response' (the latest campaign)

To identify patterns, we will first identify feature correlations. positive correlations between features appear red, negative correlations appear blue, and no correlation appears grey in the clustered heatmap below.

Number of web visits in the last month is not positively correlated with number of web purchases Instead, it is positively correlated with the number of deals purchased(as the line is positive), suggesting that deals are an effective way of stimulating purchases on the website

STATISTICAL ANALYSIS

We need to present our statistical analysis by using Linear Regression Model on the column 'NumStorePurchases' as our target variable and then use Machine Learning techniques to get insights about which features predict the number of store purchases.

Dropping uninformative Features:

ID is unique to each customer

Dt_Customer is dropped because we will use the Year_Customer engineered variable.

We will perform one-hot encoding technique of categorical features

From the above data, we can see that our One-Hot encoding technique is working fine.

We will use Linear Regression model to our dataset. 70% of data will go into training dataset and 30% of data will go into testing data set.

We will use RSME on our testing data

Here, as you can see the RSME is extremely small as compared to the median value, indicating good model predictions.

Identifying significant features that affect the number of store purchases, using permutation importance:

Significant Features: 'TotalPurchases', 'NumCatalogPurchases', 'NumWebPurchases', 'NumDealsPurchases'

Explore the directionality of these effects, using SHAP values:

From the above shap plot, we can see that:

  1. The NumStorePurchases increases when there is an increase in TotalPurchases.
  2. The NumStorePurchases decreases when there is an increase in NumCatalogPurchases, NumWebPurchases, NumDealsPurchases.

Inference from the above barchart:

  1. Spain has the highest number of purchases.
  2. US is second to the last, therefore US doesnot fare better in terms of total number of purchases compared with the rest of the world.

Inference from the above barchart:

  1. Spain has spent the maximum amount on purchases.
  2. US is second to the last, therefore US doesnot fare better in terms of total amount spent on purchases compared with the rest of the world.

We can assume a case where people who spent an above average amount on gold in the last 2 years would have more in store purchases. We will check by using lmplot from seaborn of two columns MntGoldProds and NumStorePurchases.

There is a positive relationship but we have to find out whether it is statistically significant.

MntGoldProds contains outliers so we need to perform Kendall correlation analysis(non-parametric test)

Yes, there is a significant positive corelation between MntGoldProds and NumStoreProcedures

Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish?

We will compare 'MntFishProducts' between Married PHD candidates and all other candidates.

Married PhD candidates spent less amount on Fishproducts as compared to other candidates.

Like the NumStorePurchases LinearRegression model performed above, we will create another LinearRegression model on MntFishProducts as our target variable, and then use Machine Learning Algorithms to get insights about which features predict the amount spend on fish.

As it is clear that the RSME is much smaller than the target variable, so our model predictions is doing extremely well.

Identify features that significantly affect the amount spent on fish, using permutation importance

Significant Features: 'TotalMnt', 'MntWines', 'MntMeatProducts', 'MntGoldProds','MntSweetProducts','MntFruits'

Findings:

  1. Amount spend on fish increases with higher amount spent.
  2. Amount spend on fish decreases with higher amount spent on Wines, Meat, Gold, Fruits, Sweet products.

So, the customer who spent more on fish are likely to spent less on other products like Wines, Meat, Gold, Fruits, Sweet products

Finding Significant relationship between Geographical Regional and Success of a Campaign:

  1. We conclude that the Campaign acceptance rates are low overall.
  2. The campaign with the highest overall acceptance rate is the most recent campaign (column name: Response)
  3. The country with the highest acceptance rate in any campaign is Mexico

We need to find if the effect on regions on campaign success is significantly successful or not.

Findings: The regional differences in advertising campaign success are statistically significant.

Data Visualization

We will now plot the marketing campaign with overall acceptance rates:

We have to plot between 'Index' and 'percent'

We conclude that the most successful campaign is the most recent one. Column: 'Response'

We conclude that:

  1. The average birth year of a customer is 1969
  2. The average year of becoming a customer is 2013
  3. The average income is around 52000dollar
  4. Has 1 dependent (roughly equally split between kids or teens)
  5. Made a purchase from our company in the last 49 days

We conclude that:

The average customer spent...

  1. 25-50(dollar) on Fruits, Sweets, Fish, or Gold products
  2. Over 160Dollar on Meat products
  3. Over 300Dollar on Wines
  4. Over 2400Dollar total

Products performing best: Wines Followed by meats

Lets find out which channels are underperforming:

We conclude that:

  1. Accepted less than 1 advertising campaign
  2. Made 2 deals purchases, 2 catalog purchases, 4 web purchases, and 5 store purchases
  3. Averaged 14 total purchases
  4. Visited the website 5 times

Underperforming channels:

  1. Advertising campaigns
  2. Followed by deals, and catalog